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Abstract — While medical imaging typically provides massive 
amounts of data, the extraction of relevant information for 
predictive diagnosis remains a difficult challenge. Functional 
MRI ifMRI) data, that provide an indirect measure of task- 
related or spontaneous neuronal activity, are classically analyzed 
in a mass-univariate procedure yielding statistical parametric 
maps. This analysis framework disregards some important prin- 
ciples of brain organization: population coding, distributed and 
overlapping representations. Multivariate pattern analysis, i.e., 
the prediction of behavioural variables from brain activation 
patterns better captures this structure. To cope with the high 
dimensionality of the data, the learning method has to be 
regularized. However, the spatial structure of the image is not 
taken into account in standard regularization methods, so that the 
extracted features are often hard to interpret. More informative 
and interpre table results can be obtained with the ^l norm of the 
image gradient, a.k.a. its Total Variation (TV), as regularization. 
We apply for the first time this method to fMRI data, and 
show that TV regularization is well suited to the purpose of 
brain mapping while being a powerful tool for brain decoding. 
Moreover, this article presents the first use of TV regularization 
for classification. 

Index Terms — fMRI; regression; classification; regularization; 
Total Variation; spatial structure 



I. Introduction 

Functional Magnetic Resonance Imaging (or fMRI) has 
been widely used for more than fifteen years for neuro scientific 
and cognitive studies. The analysis of these data largely relies 
on the general linear model (GLM), introduced for functional 
imaging by Friston et al. [ ]. The GLM is a simple yet 
powerful framework for deciding which brain regions exhibit 
a significantly positive task-related effect. This inference, 
also called classical inference, is based on statistical tests 
applied to each voxel separately, yielding significance maps 
{a.k.a. Statistical Parametric Maps - SPMs) for the effects 
under consideration. However, despite its simplicity and the 
accuracy of the SPMs, classical inference suffers from a major 
drawback: it analyzes each voxel separately and consequently 
cannot fully exploit the correlations existing between different 
brain regions to improve the inference. Spatial information is 
only taken into account in testing procedures, e.g. by using 
the cluster size tests in Random Field Theory. 

Correlations between brain activations are likely to arise 
as a consequence of processing in distributed populations of 
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neurons ["^], [ ], [ ]. This is particularly the case in popu- 
lation coding models [ ], [ ]. For the purpose of statistical 
inference, these models suggest that effects that differ between 
experimental conditions are not optimally characterized by 
the effect significance at individual voxels [ ], and that one 
should rather consider the combined information from differ- 
ent voxels/regions of the brain [ ]. Moreover, statistical power 
in the case of classical inference is limited by the multiple 
comparison problem (one statistical test is performed for each 
voxel and the number of comparisons has to be corrected for). 

Recently, the inference of behavioral information or cogni- 
tive states from brain activation images such as those obtained 
with fMRI has emerged as an alternative neuroimaging data 
analysis paradigm [ ], [10], [T]. It can be used to assess 
the specificity of several brain regions for certain cognitive 
or perceptual functions, by evaluating the accuracy of the 
prediction of a behavioral variable of interest - the target 
- based on the activations measured in these regions. This 
inference relies on a prediction function, the accuracy of which 
depends on whether it uses the relevant variables, i.e., the 
correct brain regions. This approach, called inverse inference, 
has some major advantages: 

• As multivariate approach, it is consistent with population 
coding models. Indeed, the neural information, which can 
be encoded by different populations of neurons, can be 
decoded using a pattern of voxels [ ], [ ]. 

• It avoids the multiple comparison issue, as it performs 
only one statistical test (on the predicted behavioral 
variable). In that sense, it can detect significant links 
between image data and target that would not have 
been detected by standard statistical parametric mapping 
procedures [ ]; note however that the statistical inter- 
pretation of these two tests are clearly different. 

• It addresses new challenges, in particular by allowing 
to identify a new stimulus in a large dataset, based on 
already seen stimuli (as visual stimuli [ ], or nouns as- 
sociated with new images [ ]). Moreover, it can be used 
for the more challenging generalization of the prediction 
to unknown high level stimuli [ ], which opens a deeper 
understanding on brain functional organization. 

Many machine learning methods have been applied to 
fMRI activation images. Among them are linear discriminant 
analysis [9], support/relevance vector machines [10], neural 
networks [ ], Lasso [18], elastic net regression [ ], kernel 
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ridge regression [20], boosting [ ], sparse logistic regression 
[22], [ J or Bayesian regularization [ ], [ ], [ ]. More- 
over, fMRI data are intrinsically smooth, so that their spatial 
structure has to be taken into account. Spatial information has 
thus been considered within the inverse inference framework, 
by using specific priors in a Bayesian framework [ ] or 
by creating spatially informed features [ ]. In the inverse 
inference problem the main objective remains the extraction 
of informative regions within the brain volume (see [ ] for a 
review). Besides prediction accuracy, an even greater challenge 
in brain functional imaging, is the ability of the method to 
provide an interpretable model (see e.g. [ ]). Ultimately, the 
predictive function learned from the data should be as explicit 
as standard statistical mapping results. This double objective 
is addressed by the present contribution. 

In practice, selecting the relevant voxels - called features 
in machine learning - is fundamental in order to achieve 
accurate prediction. However, when the number of features 
(voxels) is much larger than the numbers of samples (images), 
the prediction method may overfit the training set. In other 
words, it fits seemingly predictive information from noise in 
the training set, and thus does not generalize well to new 
data. To address this issue, one can reduce the number of 
features. A classical strategy consists of preceding the learning 
algorithm with a feature selection procedure that drastically 
reduces the spatial support of predictive regions. To date, 
the most widely used method for feature selection is voxel- 
based Anova (Analysis of Variance), that evaluates each voxel 
independently. This is often combined with the use of Support 
Vector Machine as prediction function (see [29], [10], [ ], 
[31], [ ]). An alternative approach consists in performing 
the model estimation by taking the high dimensional data 
as input while using relevant regularization methods. These 
regularizations are performed with two possible goals: stabi- 
lizing the estimation of the weights of the features, and/or 
forcing a majority of features to have close to zero weights 
{i.e. promoting sparsity). 

Let us introduce the following predictive linear model: 

y = /(X,w,6) = F(Xw + 6) , (1) 

where y represents the behavioral variable and (w, 6) are 
the parameters to be estimated on a training set. A vector 
w G can be seen as an image; p is the number of features 
(or voxels) and 6 G M is called the intercept. The matrix 
X G M^^^ is the design matrix. Each row is a p-dimensional 
sample, i.e., an activation map related to the observation. It 
has been shown [ ], [ ] that using a non-linear classifier does 
not improve the prediction accuracy, and yields interpretation 
issues. Thus, we only focus on linear classifiers in this paper. 
Depending on whether the variable to be predicted takes scalar 
or discrete values, the learning problem is either a regression 
or a classification problem. In a linear regression setting, / 
reads: 

/(X,w,6) = Xw + 6 , (2) 

with y G M"^. In the case of classification with a linear model, 
/ is defined by: 

/(X,w,6) = sign(Xw + 6) , (3) 



where "sign" denotes the sign function and y G { — 1,1}^. 

The crucial issue here is that n <C p, so that estimating 
w is an ill-posed problem. The estimation requires therefore 
adapted regularization. A standard approach to perform the 
estimation of w with regularization uses penalization of a 
maximum likelihood estimator. It leads to the following min- 
imization problem: 

w = argmin C{y, F(Xw + h)) + A J(w) , A > (4) 

where AJ(w) is the regularization term and £(y, F(Xw + 6)) 
is the loss function. The parameter A balances the loss function 
and the penalty J(v^). Note that the intercept h is not included 
in the regularization term. 

The use of the intercept is fundamental in practice as it 
allows the separating hyperplane to be offset from 0. However 
for the sake of simplicity in the presentation of the method, 
we will from now on consider h as an added coefficient in 
the vector w. This is classically done by concatenation of a 
column filled with 1 in the matrix X. The loss function will 
also be abbreviated >C(w). 

In the formalism of (4), the reference method is elastic net 
[jj], which is a combined and ^2 penalization: 

p 

AJ(w) = Ai||w||i + A2||w||^ = ^ Ail^.l + A2^,' (5) 

i=l 

Elastic net has two limit cases: A2 = is the Lasso [34] which 
yields an extreme sparsity in the selected features, and Ai = 
corresponds to Ridge regression [35]. 

A major limitation of the methods cited above, including 
the latter penalization, is that they do not take into account 
the underlying structure of w. In the case of brain images, w 
is defined on a spatial 3 -dimensional grid. The main motivation 
for using this spatial structure is that the predictive information 
is most likely organized in regions, and not randomly spread 
across voxels [36], [28]. As it is demonstrated in this contri- 
bution, one can both decrease the complexity of the results 
{i.e increase the interpretahility of the results by extracting a 
small set of spatially coherent regions of interest) as well as 
increase the accuracy of the prediction by taking into account 
the spatial relations between voxels. 

In this article, we develop an approach for regularized pre- 
diction based on Total Variation (TV), J(w) = TV{v^). TV, 
mathematically defined as the ii norm of the image gradient, 
has been primarily used for image denoising [ ], [ ] as 
it preserves edges. The motivation for using TV for brain 
imaging is that it promotes estimates w of w with a block 
structure, creating regions with piecewise constant weights, 
and therefore outlining the brain regions correlated to the target 
behavioral variable. Indeed, we are expecting that the spatial 
layout of the neural code is sparse and spatially structured in 
the sense that non-zero weights are grouped into connected 
clusters. Weighted maps showing such characteristics will 
be called interpretable, as they fulfill our hypothesis on the 
spatial layout of neural coding [ ]. This approach is closely 
related to the one developed in [ ], that introduce proximity 
information about the features in the regularization term. 
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In this contribution, the mathematical and implementation 
details of TV regression/classification are first detailed. As far 
as we know, the present work is the first to use TV in the 
context of image classification and also the first one to propose 
the use of the image structure in the learning framework of 
(4) in the context of fMRI inverse inference. We apply both 
TV regression and TV classification to an fMRI paradigm that 
studied the processing of object shape and size in the human 
brain. Results show that TV outperforms other state of the 
art methods, as it yields better prediction performance while 
providing weights w with an interpretable spatial structure. 

II. Total Variation and prediction 

We first detail the notations of the problem. We then 
develop the TV regularization framework. Finally, we detail 
the algorithm used for regression and classification. 

A. Notations 

Let us define 1] C the 3D image domain, discretized on 
a finite grid. The coefficients w define a function from Vt to 
R, i.e., w : ^] ^ R. Its TV reads: 

T^w) = ^llVwIIH 

= ^V^w{ujf + Vyw(u;)2 + V,w(u;)2 

Let us assume that uj stands for the voxel at position (z, j, /c), 
away from the border of Vt, then Vx^{^)'^ corresponds to 
(wi+i^j^/e — w^^j^/c)^ (see appendix A for more details). TV can 
be used with different discretizations, such as an anisotropic 
discretization. However, such a discretization is biased in the 
direction of the axes of the image, which is problematic 
especially with a strong regularization. Indeed, an isotropic 
discretization promotes sparse gradient along the image axes. 
We use therefore the standard isotropic discretization of 
TV [38], [41]. 

We denote y G R^ the targets to be predicted, and X G 
]^nxp q£ activation images related to the presentation 

of different stimuli. The integer p is the number of voxels and 
n the number of samples (images). Typically, p ~ 10^ to 10^ 
(for a whole volume), while n~ 10 to 10^. We denote M the 
mask of the brain that comes from standard fMRI analysis, and 
that is used to avoid computation outside of the brain volume. 
M is a Pi X pj X pk three dimensional grid, with: 

Mij^k = 1 if the voxel is in the mask 
Mi,j,k = if the voxel is not in the mask 

with ^ . J ^ Mij^k = P- Additionally, we define grad : 
R{n) a gradient operator and div : R^{n) R{n) 

the associated adjoint divergence operator (the adjoint operator 
is used in the convex optimization algorithm, see appendix A 
for more details, in particular Eq. 13). 
Let K the convex set defined by: 

K = {g '.n^R^ \ yuj eft, \\g{uj)\\ < 1} 



and Uk the projection operator onto the set K: 

UK{g){u;) = g{u) if ||^(a;)||<l 

UK{g){uj) = otherwise. 

This projection operator will be used in the optimization loop 
solving Eq. 2, to apply the constraint. It can be viewed as the 
projection on the ioo norm (dual of the £i norm) ball. 

B. Convex optimization 

We consider the minimization problem (4). When J(w) 
is non-smooth (i.e. not differentiable), an analytical solution 
does not exist and the optimization can unfortunately not be 
performed with simple algorithms such as Gradient descent 
and Newton method. This is for example the case with 
J(w) = ||w||i (£i norm a.k.a. Lasso penalty) and with 
J(w) = T]/(w), both of which require advanced optimization 
strategies. 

A recently studied strategy ([ 1-2], [43], [44], [45]) is based 
on iterative procedures involving the computation of proximity 
operators (see def. 1) [ ]. Such approaches are adapted to 
composite problems with both a smooth term and a non- 
smooth term as it is the case here (see [ ] for a recent 
review). In the context of neuroimaging, such optimization 
schemes have been proposed recently in order to solve the 
inverse problem of magneto- and electro-encephalography 
(collectively M/EEG) when considering non £2 priors [ ], 
[49]. 

Definition 1 (Proximity operator). Let J : R^ ^ R be a 

proper convex function. The proximity operator associated 
with J and A G R+ denoted by prox^j : R^ ^ R^ is given 
by: 

prox^jiw) = argmin^ Q l|v - w||^ + AJ(v)^ 

The iterative procedure known as ISTA (Iterative Shrinkage- 
Thresholding Algorithm, a.k.a Forward-Backward iterations) 
[42], [43], is based on the alternate minimization of the loss 
term C(w), by gradient descent, and the penalty J(w), by 
computing a proximity operator. One can show (see appendix 
B for a sketch of the proof), that this can be done in one single 
step by iterating: 

^(fc+i) ^ prox;,^^^ ^w(^) - ^V/:(w(^))^ , (6) 

where ^V>C(w'^^^) is the gradient descent term with a stepsize 
prox;^j/^ is the proximity operator of the penalty and the 
scalar L is an upper bound on the Lipschitz constant Lq of 
the gradient of the loss function. The pseudo code of the ISTA 
procedure is defined in Algo. 1. 

Inspired by previous findings [ ], the FISTA (Fast Iterative 
Shrinkage -Thresholding Algorithm) procedure [45], [50] has 
been developed to speed up the convergence of ISTA. While 
ISTA converges in 0(1/K), FISTA is proven to converge in 
0(1/K^), where K is the number of iterations. The pseudo 
code of the FISTA procedure is given in Algo. 2. The main 
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improvement in FISTA is to compute the next descent direction 
using the previous one. Such an idea is also present in the 
well known conjugate gradient algorithm that uses all previous 
iterates to compute the next descent direction. 



Algorithm 1: ISTA procedure 
Compute the Lipschitz constant Lq of the operator V£. 
InitiaHze w^^) G W 
repeat 

^(/c+i) ^ prox;,^/^ (w(^) - ^V/:(w^^))) 

where L > Lq. 
until convergence', 
return w 



Algorithm 2: FISTA procedure 
Compute the Lipschitz constant Lq of the operator V£. 
Initialize w^^) G W, v^^) = w^^) and h = 1. 
repeat 

w(^) = prox,^/^(^v(^)-iv/:(v(^))) 

i + yrT4t| 

t/c+l - 

V / 

until convergence; 

return w 

Let us introduce now the notion of Duality gap. The duality 
gap is a natural stopping condition for approaches as ISTA 
and FISTA. In practice, if the duality gap is below a value 
e > 0, it guarantees that the solution obtained is e-optimal, i.e., 
that the value of the cost-function reached by the algorithm 
is not greater than e more the globally optimal value. A 
comprehensive presentation of this notion [ ] is beyond the 
scope of this paper, and we now give some details in the 
particular case of the proximity operator prox;^^y known as 
the ROF problem [ ] (named after the authors L. Rudin, S. 
Osher and E. Fatemi) in the image processing literature. 

The computation of prox;^^y and the associated duality gap 
requires the derivation of a Lagrange dual problem [ ]. 

Proposition 2 (prox;^^y Dual problem). A dual problem 
associated with prox^j^y is given by 

z* = argmax — II J/vz + W/AII2 , (7) 

where z is the dual variable that satisfies v* = w + Xdivz*, 
with V* = prox^j.y(w) 

This result is adapted from [^^] (see appendix C for a sketch 
of the proof). The problem (7) is a maximization of a smooth 
concave function over a convex set. As shown in [50], it can 
be solved with the FISTA iterative procedure. The resolution 
of the ROF problem is therefore achieved by solving the dual 
problem. Once z* is obtained, v* = prox;^^y(w) is given by 
V* = w + Adivz*. 



The latter result also gives an estimate of the duality gap. 

Proposition 3 (Duality gap). The duality gap 5 gap associated 
with the ROF problem is given by: 

<^.ap(v) = i||w - v||i + ATF(v) - i(||w||l + llvlll) > , 

(8) 

where the primal variable v is obtained during the iterative 
procedure from the current estimate of the dual variable z 
with V = w + A div z. 

See appendix C for more details. This duality gap will be 
used as a stopping criterion for the FISTA procedure solving 
the ROF problem. At each iteration of the FISTA procedure, 
we will stop the iterative loop if the duality gap is below a 
given threshold e. In practice, e is set to 10~^ x ||w||2 to be 
invariant to the scaling of the data. 

Note that the ROF problem can be also solved using very 
efficient combinatorial optimization methods [ ], when using 
the anisotropic discretization of TV. 

C. Prediction framework 

We now detail the original contribution of this work, that 
is the construction of a predictive framework using the TV 
regularization. For J(w) = TV{w), the global algorithm for 
solving the minimization problem defined in (4) consists in 
a FISTA procedure (resolution of the ROF problem) nested 
inside an ISTA procedure (resolution of the main minimization 
problem). The FISTA procedure is performed at each step of 
ISTA with a warm restart on the dual variable z. We do not 
use FISTA for solving the main minimization problem, as this 
procedure requires an exact proximity operator. The resolution 
of the ROF problem only leads to an e-optimal solution. The 
pseudo-code of the global algorithm for the TV regularization 
is provided in Table 3. 

A difficulty specific to fMRI data is the computation of the 
gradient and divergence over a mask of the brain with correct 
border conditions (see appendix A for details). Moreover, with 
such an irregular domain, the upper bound L for the Lipschitz 
constant of the FISTA procedure also needs to be estimated 
on each input data. To do this we use a power method that 
is classically used to estimate the spectral norm of a linear 
operator, here equal to the Laplacian A : ft ^ Q defined by 
A{uj) = div(grad(co')). 

TV regression: The regression version of the TV is called TV 
regression. In this case, we use the least-squares loss: 

Aw) = ^||y-Xw|p 
V/:(w) = -lX^(y-Xw) 

The Lipschitz constant Lq of the operator V£ is Lq = 
|||X^X|||/n, where |||.||| stands for the spectral norm equal 
to largest singular value. The constant L is set in practice to 
L = l.lLo. 

TV classification: The classification version of the TV is 
called TV classification. This algorithm is based on a logistic 
loss [ ]. We now give the mathematical formulation for the 
binary case with y G { — 1, 1}^. The logistic regression model 
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Algorithm 3: TV regularization solver 



Set maximum number of iterations K (ISTA). 
Set the threshold e on the dual gap (FISTA). 
Set L = l.lLo where Lq is the Lipschitz constant of V£. 
Set L = l.lLo where Lq is the Lipschitz constant of the 
Laplacian operator A : w e R{ft) div(grad(t(;)). 
Initialize z G ]R(r^^) with zeros. 
### ISTA loop ### 
for k = 1 . . .K do 
u = w — -^V£(w) 
### FISTA loop ### 
Initialize Zaux = z, t = 1 
repeat 

Zold Z 

z = Uk 

told = ^ 
t 



(zaux - 3^grad(Lu + Adiv(za^^))^ 



'-'aux 



4t2)/2 
pl(z- 



until (^c/ap(u + \div{z)) < e; 
w = u + Adiv(z) 
return w 



defines the conditional probability of yi given the data as: 

1 



P(^i|xi,w) 



(9) 



The corresponding loss and the loss gradient read: 

'>C(w) = ^ ELi log (l + exp-^^(- 

V£(w) = -iEr=xT^^;^ 

The Lipschitz constant Lq of the operator VC is Lq = 
||X|p/(4n). The classification framework developed in this 
paper treats the binary case with a logistic model, a.k.a., 
binomial model. In our analysis, we expand this framework to 
multiclass classification using a one-versus-one voting heuris- 
tic. The number of classifiers used is (k) x (/c — l)/2, where k 
is the number of classes. The predicted class is then selected 
as the class which yields the highest probability across the 
predictions of all of the classifiers, as defined in (9). Note that 
a multinomial approach could also be used [ ] . However the 
resulting weights w become impossible to interpret, so that the 
multinomial model may not adapted to the applicative context. 
Indeed, with three classes, one gets two hyperplanes from 
which it is hard to draw any neuro scientific conclusions. The 
weights of each binary classifier have a simpler meaning. This 
one-versus-one voting heuristic is the one used in LibSVM 
[54]. 

D. Performance evaluation 

Our method is evaluated with a cross-validation procedure 
that splits the available data into training and validation sets. 
In the following, (X^,y^) are a learning set, (X^,y^) a test 
set and = F(X^w) refers to the predicted target, where w 
is estimated from the training set. 



For regression analysis, the performance of the different 
models is evaluated using the ratio of explained variance: 

t ^ var(y^) - var(y^ -y^) 
var(y^) 

This is the amount of variability in the response that can be 
explained by the model (perfect prediction yields C = 1^ while 
C < if prediction is worse than chance). 

For classification analysis, the performance of the different 
models is evaluated using the classification score denoted n , 
classically defined as: 



^(y*,y*) 



where is the number of samples in the test set, and 5 is 
Kronecker's delta. 

The p-values are computed using a Wilcoxon signed-rank 
test on the prediction score. 

E. Competing methods 

In our experiments, TV regression is compared to different 
state of the art regularization methods: 

• Elastic net regression [ ], that requires setting two 
parameters Ai and A2 (5). In our analyzes, a cross- 
validation procedure within the training set is used 
to optimize these parameters. Here, we use Ai G 
{0.2A,0.1A,0.05A,0.01A}, where A = ||X^y||oo, and 
A2 G {0.1, 0.5, 1., 10., 100.} (Ai and A2 parametrize two 
different types of norm). 

• Support Vector Regression {SVR) with a linear kernel 
[ ], which is the reference method in neuroimaging. The 
C parameter is optimized by cross-validation in the range 
10~^ to 10^ in multiplicative steps of 10. 

TV classification is compared to different state of the art 
classification methods: 

• Sparse multinomial logistic regression (SMLR) classifi- 
cation [ ], that requires a double optimization, for the 
two parameters Ai and A2. A cross-validation procedure 
within the training set is used to optimize these pa- 
rameters. Here, we use Ai G {0.2A, O.IA, 0.05A, O.OIA}, 
where A = ||X^y ||oo, and A2 G {0.1, 0.5, 1., 10., 100.}. 

• Support Vector Classification (SVC) with a linear kernel 
[ ], which is the reference method in neuroimaging. The 
C parameter is optimized by cross-validation in the range 
10~^ to 10^ in multiplicative steps of 10. 

All these methods are used after an Anova-hsiscd feature se- 
lection as this maximizes their performance. Indeed, irrelevant 
features and redundant information can decrease the accuracy 
of a predictor [ ]. The optimal number of voxels is selected 
within the range {50, 100, 250, 500}, through a nested cross- 
validation within the training set. We do not select directly a 
threshold on p-value or cluster size, but rather a number of 
features. Additionally, we check that increasing the range of 
voxels (i.e. adding 2000 in the range of number of selected 
voxels) does not increase the prediction accuracy on our 
datasets. The parameter estimation of the learning function 
is also performed using a nested cross-validation within the 
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training set, and thus, the cross- vaHdation framework is used 
rigorously in all the experiments of this paper. All methods 
are developed in C and used in Python. The implementation 
of Elastic net is based on coordinate descent [ ], while SVR 
and SVC are based on LibSVM [ ]. Methods are used from 
Python via the Scikit-learn open source package [ ]. 

III. Experiments 

A. Details on simulated data 

The simulated data set X consists of n = 100 images (size 
12x12x12 voxels) with a set of four square Regions of 
Interest (ROIs) (size 2 x 2 x 2). We call U the support of the 
ROIs {i.e. the 32 resulting voxels of interest). Each of the four 
ROIs has a fixed weight in {-0.5,0.5,-0.5,0.5}. We call 
Wij^k the weights of the (i, j, k) voxel. The resulting images 
are smoothed with a Gaussian kernel with a standard deviation 
of 2 voxels, to mimic the correlation structure observed in real 
fMRI data. To simulate the spatial variability between images 
(inter- subject variability, movement artifacts in intra- subject 
variability), we define a new support of the ROIs, called IZ 
such as, for each image 50% (randomly chosen) of the 
weights w are set to zero. Thus, we have 1Z C1Z. We simulate 
the target y for the l^^ image as: 



yi = Wij^kXij^k,i + Q 

]th 



(10) 



with the signal in the (z, j, k) voxel of the l^^ image simulated 
as: 

Ar(0, 1) (11) 

and ei ~ A/'(0, 7) is a Gaussian noise with standard deviation 
7 > 0. We choose 7 in order to have a signal-to-noise ratio 
of 5 dB. We compare TV regression cross- validated with 
different values of A in the range {0.01,0.05,0.1,}, with 
the two reference algorithms, elastic net and SVR. All three 
methods are optimized by 4-folds cross-validation in the range 
described below. 

B. Details on real data 

We apply the different methods on a real fMRI dataset 
related to an experiment studying the representation of objects, 
on ten subjects, as detailed in [ ]. During this experiment, ten 
healthy volunteers viewed objects of two categories (each one 
of the two categories used in equal halves of subjects) with 4 
different exemplars each shown in 3 different sizes (yielding 
12 different experimental conditions), with 4 repetitions of 
each stimulus in each of the 6 sessions. We pooled data from 
the 4 repetitions, resulting in a total of n = 72 images by 
subject (one image of each stimulus by session). Functional 
images were acquired on a 3-T MR system with eight-channel 
head coil (Siemens Trio, Erlangen, Germany) as T2*-weighted 
echo-planar image (EPI) volumes. Twenty transverse slices 
were obtained with a repetition time of 2 s (echo time, 
30 ms; flip angle, 70°; 2 x 2 x 2-mm voxels; 0.5-mm gap). 
Realignment, normalization to MNI space, and General Linear 
Model (GLM) fit were performed with the SPM5 software 



(http://www.fil.ion.ucl.ac.uk/spm/software/spm5). The normal- 
ization is the conventional one of SPM (implying affine and 
non-linear transformations) and not the one using unified seg- 
mentation. The normalization parameters are estimated on the 
basis of a whole-head EPI acquired in addition, and are then 
applied to the partial EPI volumes. The data are not smoothed. 
In the GLM, the effect of each of the 12 stimuli convolved 
with a standard hemodynamic response function was modeled 
separately, while accounting for serial autocorrelation with an 
AR(1) model and removing low-frequency drift terms by a 
high-pass filter with a cut-off of 128 s. The GLM is fitted 
separately in each session for each subject, and we used in 
the following analyzes the resulting session-wise parameter 
estimate images the /3-maps are used as rows of X). All the 
analyzes are performed without any prior selection of regions 
of interest, and use the whole acquired volume. 
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Fig. 1. Experiment paradigm for the classification of object in each of 
the category (left) and regression (right) experiments. Each color represents 
the stimuli which are pooled together in one of the three experiments 
(classification category 1, classification category 2 and regression). 

Regression experiments: First, we perform an intra- subject 
regression analysis. The four different shapes of objects (for 
the two categories) were pooled across for each one of the 
three sizes, and we are interested in finding discriminative 
information between sizes. This reduces to a regression prob- 
lem, in which our goal is to predict a simple scalar factor 
(size of an object) (see Fig. 1). Each subject is evaluated 
independently, in a 12-fold cross-validation. The dimensions 
of the real data set for one subject are p ~ 7x10^ and n = 72 
(divided in 3 different sizes, 24 images per size). We evaluate 
the performance of the method by a leave-one-condition-out 
cross-validation {i.e., leave-6-images-out), and doing so the 
GLM is performed separately for the training and test sets. 
The parameters of the reference methods are optimized with 
a nested leave-one-condition-out cross-validation within the 
training set, in the ranges given before. 

Additionally, we perform an inter-subject regression analy- 
sis on the sizes. The inter-subject analysis relies on subject- 
specific fixed-effects activations, i.e. for each condition, the 6 
activation maps corresponding to the 6 sessions are averaged 
together. This yields a total of 12 images per subject, one 
for each experimental condition. The dimensions of the real 
data set are p ~ 7 x 10^ and n = 120 (divided in 3 
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different sizes). We evaluate the performance of the method 
by cross-vaHdation (leave-one- subject-out). The parameters of 
the reference methods are optimized with a nested leave- 
one-subject-out cross-validation within the training set, in the 
ranges given before. 

Classification experiments: We evaluate the performance 
on a second type of discrimination which is object classifica- 
tion (see Fig. 1). In that case, we averaged the images for the 
three sizes and we are interested in discriminating between 
individual object exemplars/shapes. For each of the two cate- 
gories, this can be handled as a classification problem, where 
we aim at predicting the shape of an object corresponding to a 
new fMRI scan. In order to investigate the performance of TV 
classification, which is an original contribution, we perform 
an inter-subject analysis in the same way as described for the 
regression study, except that now, we perform two analyzes 
corresponding to the two categories used, each one including 
5 subjects. 

Statistical Parametric Maps: For comparison purposes, 
the corresponding maps of Anova {F -score), or SPMs, for 
the inter-subject analysis are given Fig. 2, for the represen- 
tation of sizes (top) and representation of objects for the two 
categories (middle and bottom). As expected, the sizes are 
mostly processed in primary visual cortex, while for objects, 
discrimination is additionally observed in lateral occipital 
regions [ ]. 
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Fig. 2. Inter-subject analysis - Maps of Anova (— log(p-values)) for the 
sizes prediction experiment (top) and the objects identifications for category 
1 (middle) and category 2 (bottom). We threshold the p-values higher than 
10~2 (i.e. - log(p-values) > 3). 



IV. Results 

A. Results on simulated data 

We compare the different methods on the simulated data, 
see Fig. 3. The true weights (a) and resulting Anova F-scores 
(b) are shown. Only TV regression (e) is able to extract the 



simulated discriminative regions. Elastic net (d) only retrieves 
part of the support of the weights, and yields an overly 
sparse solution. This sparsity pattern obtained with elastic net 
is the one that yields the highest prediction accuracy: one 
could seek a less sparse solution, but this would decrease the 
prediction accuracy. We note that the weights in the primal 
space estimated by SVR (c) are everywhere non-zero and do 
not retrieve the support of the weights. 



B. Sensitivity study on real data 

Before any further analysis on real data, we have performed 
a sensitivity analysis of our model with regards to the param- 
eter A. In the inter-subject analysis for the size regression, 
we compute the cross-validated prediction accuracy for twelve 
different values of A between 10~^ and 0.95. The aim of the 
sensitivity study is to assess the stability of the prediction with 
respect to the regularization parameter. The results, detailed in 
Fig. 4, are extremely stable with respect to A in the range 
[5.10~^, 5.10"^]. For this reason, we can fix A = 0.05 in 
the following analyzes. The value of A is the same for all 
the experiments, in both classification and regression settings. 
The correct way of choosing the regularization parameter 
is to embed the TV regularization within an internal cross- 
validation on the training set. However, such approach can be 
computationally costly. 




Fig. 4. Explained variance C, for different values of A, in the inter- subjects 
regression analysis. The accuracy is very stable regarding to A in the range 
[5.10-4,5.10-1]. 



C. Results for regression analysis 

In a first set of analyzes, we assess the performance of TV 
regression in both intra-subject and inter-subject cases, where 
the aim is to predict the size of an object seen by the subject 
during the experiment. 

Intra-subject analysis: The results obtained by the three 
methods are given in Tab. I. TV regression outperforms 
the two alternative methods, yielding an average explained 
variance of 0.92 across the subjects. The difference with SVR 
is significant, but not with elastic net. Moreover, the results 
of the regularized methods {TV, elastic net) are more stable 
(standard deviation three times smaller) across subjects, than 
the results of the SVR. 
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Fig. 3. Two-dimensional slices of the three-dimensional volume of simulated data (top), and weights found on the diagonal (green squares) of the first 
two-dimensional slice (bottom). Comparisons of the weights found by different methods, with the true target (a), and the F-score found by Anova (b). The 
TV method (e) retrieves the true weights. The reference methods ((c), (d)) yield less accurate maps. Indeed, the support of the weights found by elastic net 
is too sparse and does not yield convex regions. SVR yields smooth maps that do not look like the ground truth. 



Methods 


mean C, 


std C 


max C 


min C, 


p-value to TV 


SVR 


0.82 


0.07 


0.9 


0.67 


0.0051 


Elastic net 


0.9 


0.02 


0.93 


0.85 


0.0745 


TV a = 0.05 


0.92 


0.02 


0.95 


0.88 





TABLE I 

Regression - Sizes prediction experiment - Intra-subject analysis. 
Explained variance C for the three different methods. TV 

REGRESSION YIELDS THE BEST PREDICTION ACCURACY, WHILE BEING 
MORE STABLE THAN THE TWO REFERENCE METHODS (STANDARD 
DEVIATION OF C THREE TIMES SMALLER THAN W/?). THE P- VALUES ARE 
COMPUTED ON THE EXPLAINED VARIANCE OF THE DIFFERENT SUBJECTS. 



Inter-subject analysis: The results obtained with the three 
methods are given in Tab. 11. As in the intra-subject analysis, 
TV regression outperforms the two alternative methods, yield- 
ing an average explained variance of 84%, and also more stable 
predictions. Such stability can be illustrated on the subject 3, 
where both reference methods yield poor results, while TV 
regression yields an explained variance 0.2 higher. Moreover, 
we have tested that feature selection minimizes overfitting. 
Indeed, without such feature selection, we obtain a smaller 
explained variance of 76% for SVR and 64% for elastic net. 



Methods 


mean 


std C 


max C 


min 


p-value to TV 


SVR 


0.77 


0.11 


0.97 


0.58 


0.0284 


Elastic net 


0.78 


0.1 


0.97 


0.65 


0.0469 


TV A = 0.05 


0.84 


0.07 


0.97 


0.72 





TABLE II 

Regression - Sizes prediction experiment - Inter-subject analysis. 
Explained variance C for the three different methods. TV 

REGRESSION STILL YIELDS THE BEST PREDICTION ACCURACY, WITH AN 
EXPLAINED VARIANCE 0.06 HIGHER THAN THE BEST REFERENCE METHOD 

{elastic net). The p- values are computed on the explained 

VARIANCE OF THE DIFFERENT SUBJECTS. 



The weighted maps found by the different methods are given 
in Fig. 5. One can notice that, as A increases, the spatial 
support of these maps tends to be aggregated in few clusters 
within the occipital cortex, and that the maps have a nearly 
constant value on these clusters. By contrast, both reference 



methods yield uninterpretable {i.e. more complex) maps, with 
a few informative voxels scattered in the whole occipital 
cortex. The average positions and the sizes of the three main 
clusters found by the TV algorithm, using all the subjects, are 
given Tab. III. TV regression is able to adapt the regularization 
to tiny regions, yielding ROIs from 25 to 193 voxels. The 
clusters are found within the occipital cortex. The majority 
of informative voxels are located in the posterior part of the 
occipital cortex {y < —90 mm), most likely corresponding 
to primary visual cortex, with one additional slightly more 
anterior cluster in posterior lateral occipital cortex. This is 
consistent with the previous findings [ ] where a gradient of 
sensitivity to size was observed across object selective lateral 
occipital ROIs, and the most accurate discrimination of sizes 
in primary visual cortex. 



X (mm) 


y (mm) 


z (mm) 


Sizes (voxels) 


24 


-92 


-16 


25 


-26 


-96 


-10 


103 


16 


-96 


12 


193 



TABLE III 

iNTER-SUBJECT REGRESSION ANALYSIS: POSITIONS AND SIZES OF THE 
THREE MAIN CLUSTERS FOR THE TV REGRESSION METHOD. 



D. Results of classification experiments 

In a second analysis, we assess the performance of TV 
classification in an inter-subject classification analyzes, in 
which the aim is to predict which of 4 object exemplars is 
seen by the different subjects. 

The results (average across the two categories) found by 
the three methods are given in Tab. IV. As in the inter- 
subject regression analysis, the TV-based method outperforms 
the SMLR method. Moreover, it yields an average classification 
score similar to the SVC while being more stable. Seeking 
clusters of activation thus seems a reasonable way to cope with 
inter- subject variability. The average number of selections of 
each voxels within one of the three larger clusters for each 
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Methods 


mean n 


std n 


max K 


min K 


p-value to SVC 


SVC 


48.33 


15.72 


75.0 


25.0 




SMLR 


42.5 


9.46 


58.33 


33.33 


0.0 


TV A = 0.05 


46.67 


11.3 


66.67 


25.0 


1.0 



C = 0.83 




TABLE IV 

Classification - Objects prediction experiment . AVERAGED 
CLASSIFICATION SCORE K, FOR THE THREE DIFFERENT METHODS, ACROSS 
THE TWO CATEGORIES. TV CLASSIFICATION YIELDS SIMILAR 
PREDICTION ACCURACY THAN THE REFERENCE METHOD SVC. THE 
P-VALUES ARE COMPUTED ON THE CLASSIFICATION SCORE OF THE 
DIFFERENT SUBJECTS. 



7=12 ■■- — ^ ^ — * 



C = 0.84 

Fig. 5. Regression - Sizes prediction experiment - Inter-subject analysis. Maps 
of weights found by TV regression for various values of the regularization 
parameter A. When A decreases, the TV regression algorithm creates different 
clusters of weights with constant values. These clusters are easily interpretable, 
compared to voxel-based map (see below). The TV regression algorithm is 
very stable for different values of A, has shown by the explained variance (. 
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Fig. 6. Regression - Sizes prediction experiment - Inter- subject analysis. 
Maps of weights found by the SVR (top) and elastic net (bottom) methods. 
The optimal number of voxels selected by Anova is 500, but elastic net 
further reduces this set to 21 voxels. These voxel-based methods do not yield 
interpretable features (especially when compared to TV regression), which is 
due to the fact that they do not consider the spatial structure of the image. 



one-versus-one map are given Fig. 7. The informative clusters 
are more anterior and more ventral than the ones found within 
the sizes prediction paradigm. We thus confirm the results 
found by classical brain mapping approach, such as Anova 
(see Fig. 2), while providing a classification score based on 
cross-validation on independent data which allows to check the 
actual implication of these regions in the cognitive process. 




Fig. 7. Inter-subject analysis. Top - voxels selected within one of the three 
main clusters by TV regression, for the Sizes prediction experiment. Bottom - 
voxels selected at least one time within one of the three main clusters for each 
of the one-vs-one TV classification, for the Objects prediction experiment. 
Some clusters found in the Objects prediction experiment (y = —40 mm, 
y = —74 mm) are more anterior than the ones found for the Sizes prediction 
experiment (y = —92 mm, y = —96 mm). This is coherent with the 
hypothesis that the processing of shapes is done at a higher level in the 
processing of visual information, and thus the implied regions are found 
further in the ventral pathway. 



V. Discussion 

In this article, we present the first use of TV regularization 
for brain decoding. This method outperforms the reference 
methods on prediction accuracy, and yields sparse brain maps 
with clear informative foci. 

Moreover, with regard to the classification paradigm, we 
integrate the TV in a logistic regression framework. This 
approach, which to our knowledge, has not been used before, 
yields high prediction accuracy, and seems to be a promising 
method for more machine learning problems beyond the scope 
of neuroimaging. 

One major advantage of the proposed method is that, in the 
case of a multi- subject studies, considering extended regions is 
expected to compensate for spatial misalignment, hence it can 
better generalize than voxel-based methods. As proven on both 
inter-subject analyzes, the proposed TV approach yields signif- 
icantly higher prediction accuracy than reference voxel-based 
methods. In addition, the proposed approach yields weight 
maps very similar to the maps obtained by a classical brain 
mapping approach (such as Anova). We note that the solution 
found by our method has a sparse block structure and is suffi- 
cient for good prediction accuracy, which explains the fact that 
the regions observed may be more compactly localized than 
the ones from Anova. Thus, the TV approach has the assets of 
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a predictive framework, while leading to accurate brain maps. 
It is important to notice that, even if TV does not promote 
a strict sparsity of the weights, most voxels are associated 
with very small weights, and only a few clusters get high 
weightings. Moreover, TV regression allows to consider the 
whole brain in the analysis, without requiring any prior feature 
selection. As many accurate dimension reduction approaches 
such as Recursive Feature Elimination [^^] can be extremely 
costly in computational time, avoiding this step is a major 
asset. An important feature of our implementation is thus that 
it reduces computation time to a reasonable amount, so that 
it is not significantly more costly than SVR or elastic net in 
practical settings {i.e., including the cross-validation loops). In 
the inter-subject regression analysis, the average computational 
time is 185 seconds for TV regression, 131 seconds for Anova 
+ SVR and 121 seconds for Anova + elastic net, on a Intel(R) 
Xeon(R) CPU at 2.83GHz. Regularization of the voxel weights 
significantly increases the generalization ability in regression 
problems, by performing feature selection and training of the 
prediction function jointly. However, to date, regularization has 
most often been performed without using the spatial structure 
of the images. By applying a penalization on the gradient of 
the weight and thus taking into account the spatial structure 
of the image-based information, our approach performs an 
adaptive and efficient regularization, while creating sparse 
weight maps with regions of quasi constant weights. TV 
regularization method fulfills thus the two requirements that 
make it suitable for neuroimaging brain mapping: a good 
prediction accuracy (better than the reference methods for 
regression experiments, and equal for classification), and a 
set of interpretable features, made of clusters of similarly- 
tuned voxels. In that sense, it can be seen as the first method 
for performing a large scale multivariate brain mapping (the 
searchlight [ il] only consider the multivariate information in 
a small neighborhood). 

From a neuro scientific point of view, the regions extracted 
from the whole analysis volume in the size discrimination task 
are concentrated in the early visual cortex. This is consistent 
with the fact that early visual cortex yields highly reliable 
signals that are discriminative of feature/shape differences 
between object exemplars, which holds as long as no high- 
level generalization across images is required (see e.g. [ ] 
and [59]). This is expected given the small receptive fields of 
neurons in these regions that will reliably detect differences 
in the spatial envelop or other low-level structure of the 
images. Most importantly, the predictive spatial pattern is 
stable enough across individuals to make reliable predictions in 
new subjects. In fact our method compares best with regards 
to the state of the art in the inter- subjects setting, because 
it selects predictive regions that are not very sensitive to 
anatomo-functional variability. In the object discrimination 
task, the clusters found by our approach are also in the visual 
cortex, but including more anterior ones (probably correspond- 
ing to posterior lateral occipital region) compared to size 
discrimination, which is consistent with the fact that shape 
discrimination requires intermediate/higher level visual areas. 
The finding that also large parts of early visual cortex were 
discriminative here is explained by the fact that generalization 



across viewing conditions was not a part of the analysis and 
classification can therefore be driven by lower-level features. 
However, even if similar maps as the ones found by our 
method can be obtained using Anova, they do not provide 
a prediction score for generalization to independent data (i.e. 
a global measure of the involvement of the regions in the 
cognitive process). 

VI. Conclusion 

In this paper we introduce TV regularization for extracting 
information from brain images, both in regression or classifi- 
cation settings. Feature selection and model estimation are per- 
formed jointly and capture the predictive information present 
in the data better than alternative methods. A particularly im- 
portant property of this approach is its ability to create spatially 
coherent regions with similar weights, yielding simplified and 
still informative sets of features. Experimental results show 
that this algorithm performs well on real data, and is far more 
accurate than voxel-based reference methods for multi- subject 
analysis. In particular, the segmented regions are robust to 
inter-subject variability. These observations demonstrate that 
TV regularization is a powerful tool for understanding brain 
activity and spatial mapping of cognitive process, and is the 
first method that is able to derive statistical weight maps, as 
in the standard SPM approach, within the inverse inference 
framework. 
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Appendix A 
Gradient and Divergence 

The gradient operator which has to be computed on a mask 



in our case (mask of the brain). With / G 
it is defined by: 



Pi Xpj Xpk 



an image, 



(grad /) 




Otherwise 



(grad /);• 



i'ljik 







if Mi^j^k ^i,j+i,/c 1 
otherwise 



(grad /) 







otherwise 



The divergence operator for a gradient p is defined by: 



(div p) 



+ < 



l^i,j,k Pi-l,j,k 

^ij,k 

~Pi-l,j,k 



if Mi 



J,k 



Mi. 



1 



■ PiJ-l,k 



+ < 



Pi,j-l,k 

Pijyk ~ Ptj,k-1 

Pi,j,k 

~Pi,j,k-l 



if Mij^k + Mi_ij^k = 
if Mij^k + Mi_ij^k = 1 

if Mij^k = Mij_i^k = 1 
if M,,,- ^ M,,,_i,fc = 
if Mij^k ^ = 1 

if Mij^k = Mij^k-i = 1 
if Mij^k + Mi,i,/c-i = 
if M^,^- 7^ Mi,^- fe-i = 1 



Appendix B 
ista procedure 

We give the sketch of proof of (6). The loss >C(w) being 
differentiable, the second-order Hnearization of >C(w) reads: 



^(w) 



£(w('=)) + (w - wW)J V£(wW) 

- w — w (^))^v^/:(w(^))(w-w(^)) 



With I/O the Lipschitz constant of V£, we have: 

l|v/:(w) - v/:(w(^))|p < Lo||w - w(^) f 

Using [ ], we obtain: 



w*^^+^^ = argmin /^(w*^^^) H ||w - w 

w 2 



+ AJ(w) 
+ w — w 



Ignoring constant terms, this can be rewritten as: 



Appendix C 

Dual problem and duality gap computation 

We give the sketch of proofs of propositions 2 and 3. We 
recall [51] that the duality between the i\ norm and the ^oo 
norm yields: 

TF(v) = ||Vv||i = max (Vv, z) (12) 

||z||oo<l 

and that the adjoint relation between the gradient and the 
divergence operator reads: 



^(/c+i) _ argjjiin -||w — (w' 
w 2 



(Vv,z) = — (v,divz) 
Using (12) and (13), we minimize: 

mmQ||w-v||^ + ATF(v; 



(13) 



L 



(/c)> 



P + -AJ(w) 



= Amin — v||2 + max (Vv, z) 

V \2X ||z||cx.<l 

^ ^u^^^. ( TT^IIw- v||^ + (Vv,z) 

l|z||oo<i V ^ V2A 

= ^n^^'S. (wtW'^-^WI - (V,divz) 

z oo<i V V V2A 



The computation of the minimum and the maximum above 
can be exchanged because the optimization over v is convex 
and the optimization over z is concave [ ]. 
By setting the derivative with respect to v to one gets the 
resulting solution of the minimization problem over v: 

min f — - llw — vllo — (v, divz) ) =^ v*=w + Adivz 
V \2X J 

Replacing v by v* in the previous expression leads to: 
mmQ||w-v||^ + ATF(v) 



= ^n^^^-, 7Tl|divz||2 - (w,divz) - A||divz||2 

z Lo<l \ ^ 



^„^^^. -TTl|divz||2 - (w,divz) 

Z oo<l \ ^ 



= 9 n^^^, (-^'l|divz||2 - 2A(w,divz)) 

^ l|z|U<l 

= ^ max (||w||2- ||Adivz + w||2) 

^ ||z||oo<l 

This gives the proof of Prop. 2. Also, given a variable z satisfy- 
ing II z Hoc < 1 and an associated w such that v = w + Adivz, 
one can guarantee that 

i||w-v||^ + ArT/(v)>l(||w||i-||v||i) 

The strict convexity of the problem guarantees that, at the 
optimum, the equality holds. This last derivation proves the 
proposition 3. 



where L > Lq [ ]. Finally, using definition (1) of the 
proximity operator for J(w), this is equivalent to (6): 
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(fc+i) 



V£(w(^)) 



